We will start with basic data visualization in R focusing on ggplot2. Kieran Healy’s Data Visualization A practical introduction is a good resource to learn basic visualization. You can follow his book website and install his package for learning purposes

After we are familiar with basic data visualization in R, we will switch to visualize texts using these basic techniques. We don’t cover other fancy visualization tools. If you are interested in those tools, you can learn more by checking shiny, plotly, etc.

There are a couple of books you should read. R for Data Science; ggplot2 cookbook. For disclosure, some of the example codes are from R for data science.

DataViz Basics in R

We need to load some packages for use

if (!requireNamespace("pacman"))
  install.packages('pacman')
library(pacman)
packages<-c("tidyverse","tidytext","haven")
p_load(packages,character.only = TRUE)

Tidy data

Let us load doca data into R, the input is a csv file.

nyt_doca <- read_csv("data_doca.csv")

We will also use the doca main dataset. You can go here to download the dataset https://web.stanford.edu/group/collectiveaction/cgi-bin/drupal/node/21.

We use the read_dta function in haven to read stata file

main_doca <- read_dta("final_data_v10.dta",encoding = "latin1")

Let us merge two datasets together using the key identifier title and title_doca. We will use tidyverse fuction left_join. It is similar to stata merge function.

data <- main_doca %>% 
  mutate(title_doca=tolower(title)) %>% 
  left_join(nyt_doca %>% 
              mutate(title_doca=tolower(title_doca)),
            by="title_doca") %>% 
  filter(!is.na(text)) %>% 
  select(title,title_doca,text,everything())

knitr::kable(data[1:5,1:2],cap="DoCA with NYT article")
DoCA with NYT article
title title_doca
ILLINOIS UNIT IN COURT illinois unit in court
DESECRATION IN PHILADELPHIA desecration in philadelphia
VICTORS AT DALLAS ACCUSE FOES OF DIRTY PLAY AND RACIAL SLURS victors at dallas accuse foes of dirty play and racial slurs
CITY ACTION URGED TO COMBAT BIGOTS city action urged to combat bigots
CITY ACTION URGED TO COMBAT BIGOTS city action urged to combat bigots

Let us create a tidy dataset, keeping text, title_doca, eventid,event year, violence, participant size, what, purpose,whysm. here is the codebook

tidy_data <- data %>% 
  select(eventid,evyy,title_doca,text,what,purpose,whysm,particex,viold)

Let us do some cleaning about main news articles.

# let us remove puncts, white spaces like \n \t etc
tidy_data <- tidy_data %>% 
  mutate(tidy_text=tolower(text) %>% 
           str_replace_all("[:punct:]|\\s|\\d"," ") %>% 
           str_replace_all("^.*?proquest historical newspapers the new york times|.{100}$","") %>% 
           str_squish %>% 
           str_trim
         )

Understanding ggplot2

If we need to be explicit about where a function (or dataset) comes from, we’ll use the special form package::function(). For example, ggplot2::ggplot() tells you explicitly that we’re using the ggplot() function from the ggplot2 package.

We have to admit that ggplot2 is the most popular R graph package in social science community. If you don’t know about it, you should read R for Data Science; ggplot2 cookbook.

Let us load the package if you did not load it before.

library(ggplot2)

With ggplot2, you begin a plot with the function ggplot(). ggplot() creates a coordinate system that you can add layers to. The first argument of ggplot() is the dataset to use in the graph. So ggplot(data = tidy_data) creates an empty graph.

You complete your graph by adding one or more layers to ggplot(). The function geom_point() adds a layer of points to your plot, which creates a scatterplot. ggplot2 comes with many geom functions that each add a different type of layer to a plot.

Each geom function in ggplot2 takes a mapping argument. This defines how variables in your dataset are mapped to visual properties. The mapping argument is always paired with aes(), and the x and y arguments of aes() specify which variables to map to the x and y axes. ggplot2 looks for the mapped variables in the data argument, in this case, mpg.

# let us say, we want to see the number of articles by year
# we need to compute the yearly number of articles first
# we pass tidy_data to ggplot and do a scatterplot
# we assign the ggplot object to variable P
tidy_data %>% 
  distinct(title_doca,.keep_all = T) %>% 
  filter(!is.na(evyy)) %>% 
  group_by(evyy) %>% 
  summarise(articles_n=n()) %>% 
  ggplot()+
  geom_point(aes(x=evyy,y=articles_n))->p
p

Obviously this is an ugly plot. Let us do some extra work to beautify it… let us change x and y axis title and add a caption

p <- p +
  # Add titles, subtitles, caption, change x, y axis label
 labs(title = "Annual NYT Coverage of Protest in the U.S. 1960-1995",
      subtitle = "Based on a random sample (N=2000)",
      caption = "Data source: Dynamic of Collective Action and ProQuest",
      x = "Event Year",
      y = "Number of News Articles"
      )+
  # Format the title, subtitle, and caption
  theme(
    plot.title = element_text(
      color = "red", 
      size = 12, 
      face = "bold"
    ),
    plot.subtitle = element_text(color = "blue"),
    plot.caption = element_text(color = "blue", 
                              face = "italic"))+
  # relabel x and y axis
  scale_x_continuous(breaks=seq(1960,1995,5), limits=c(1960,1995))+
  scale_y_continuous(breaks=seq(0,100,10), limits=c(0,100))
  
p

I don’t like the background. Let us change the theme to the classic

# use black white theme, you can use different themes
p <-  p+
  theme_bw()
p

How about adding a smooth line?

p <- p + 
  geom_smooth(aes(x=evyy,y=articles_n))
p

This new plot contains the same x variable, the same y variable, and both line and points describe the same data. But they are not identical. They use a different visual object to represent the data. In ggplot2 syntax, we say that they use different geoms.

A geom is the geometrical object that a plot uses to represent data. People often describe plots by the type of geom that the plot uses. For example, bar charts use bar geoms, line charts use line geoms, boxplots use boxplot geoms, and so on. Scatterplots break the trend; they use the point geom. As we see above, you can use different geoms to plot the same data. The plot on the left uses the point geom, and the plot on the right uses the smooth geom, a smooth line fitted to the data.

let us try a histogram plot.

tidy_data %>% 
  distinct(title_doca,.keep_all = T) %>% 
  filter(!is.na(evyy)) %>% 
  ggplot()+
  geom_histogram(aes(x=evyy),binwidth = 0.5)+
  theme_classic()->p1
p1

Let us try a bar chart.

tidy_data %>% 
  distinct(title_doca,.keep_all = T) %>% 
  filter(!is.na(evyy)) %>% 
  ggplot()+
  geom_bar(aes(x=evyy),binwidth = 0.5)+
  theme_classic()->p2
p2

I am tired of graphing number of articles by year. Let us try purpose. We only care about those with at least 2 articles

p3 <- tidy_data %>% 
  mutate(purpose=tolower(purpose)) %>% 
  filter(purpose!="") %>% 
  group_by(purpose) %>% 
  summarise(purpose_n=n()) %>% 
  filter(purpose_n>2) %>% 
  ggplot(aes(x=purpose,y=purpose_n))+
  geom_bar(stat="identity")

p3  

Totally a mess. You can flip the x-y axis. Let us switch our x axis to y axis

p3 <- p3 +
  coord_flip()
p3

You can check the ggplot cookbook for more details

Understanding ggplot2 supplement

There are a lot of ggplot2 related packages you can use to visualize your data. For instance, gganimate, ggnet2, gganimate, ggdendro, ggthemes, ggpubr, Plotly, patchwork, ggridges,ggmap,ggrepel,ggradar,ggcorrplot,GGally

let us install them all. You should spend some time to explore these packages. If some of these packages you cannot install. You should try devtools::install_github()

ggpackages <-  c("gganimate", "ggnet2", "gganimate", "ggdendro", "ggthemes", "ggpubr", "plotly", "patchwork","ggridges","ggmap","ggrepel","ggradar","ggcorrplot","GGally")
p_load(ggpackages,character.only = T)
#devtools::install_github("ropensci/plotly")
#devtools::install_github("briatte/ggnet")

Let us try plotly

p4 <- ggplotly(p)
p4

TextViz Basics in R

We use TidyText with ggpplot2 to do some basic textvizs

tidytext::unnest_tokens provides us with a function to tokenize words: unnest_tokens( tbl, output, input, token = “words”, format = c(“text”, “man”, “latex”, “html”, “xml”), to_lower = TRUE, drop = TRUE, collapse = NULL, … )

library(tidytext)
library(SnowballC)
# we need to process text data
token_data <- tidy_data %>% 
  # create a unique id
  mutate(doca_id=row_number()) %>% 
  # Let us tokenize tidy_data text field
  unnest_tokens(output = word,input = tidy_text,token="words") %>%
  # get rid of stop words
  anti_join(tidytext::get_stopwords("en",source="snowball")) %>% 
  # let us do some stemming
  mutate(word_stem = wordStem(word)) %>% 
  filter(word_stem!="")
# let us count the word
token_data %>%
    count(word_stem, sort = TRUE)
## # A tibble: 33,619 x 2
##    word_stem     n
##    <chr>     <int>
##  1 said      13616
##  2 new       10837
##  3 time       8331
##  4 york       8233
##  5 mr         7535
##  6 school     5263
##  7 student    5120
##  8 state      4678
##  9 citi       4302
## 10 polic      4257
## # … with 33,609 more rows

let us make a document-term matrix first

dtm_data <- token_data %>%
    count(doca_id, word_stem, sort = TRUE) %>%
  cast_dtm(doca_id, word_stem, n)
dtm_data
## <<DocumentTermMatrix (documents: 2477, terms: 33619)>>
## Non-/sparse entries: 550593/82723670
## Sparsity           : 99%
## Maximal term length: 44
## Weighting          : term frequency (tf)

What are the highest tf-idf words in our documents? Let us plot them

tfidf_data <- token_data %>%
    count(doca_id, word_stem, sort = TRUE) %>%
    bind_tf_idf(word_stem, doca_id, n) %>%
    arrange(-tf_idf) %>%
    group_by(doca_id) %>%
    top_n(10) %>%
    ungroup 

knitr::kable(tfidf_data[1:10,],cap="DoCA with NYT article, TF-IDF")
DoCA with NYT article, TF-IDF
doca_id word_stem n tf idf tf_idf
405 beloit 4 0.0816327 7.814803 0.6379431
833 mastic 3 0.0714286 7.814803 0.5582002
1578 krishna 8 0.0898876 5.868893 0.5275410
1579 krishna 8 0.0898876 5.868893 0.5275410
881 fisk 5 0.0781250 6.716191 0.5247024
720 cheynei 4 0.0666667 7.814803 0.5209869
1752 arab 4 0.1176471 4.413606 0.5192478
1243 chees 6 0.0869565 5.617579 0.4884851
2268 bigotri 4 0.1000000 4.770281 0.4770281
1115 nader 5 0.0877193 5.329897 0.4675348

Replicate fighting words article metrics

knitr::include_graphics('fig1.png')
fighting words

fighting words

Let us see whether violence influences media coverage of protest… The goal here is to compare the words in two corpora: news articles with violence in protest and news articles without violence. We are trying to see what words are more likely to associate with violence and nonviolence…

We use the following formula:

\[f_{kw}^{(V)}-f_{kw}^{(NV)}\] where \[f_{kw}^{(V)}=y_{kw}^{(V)}/n_k^{(V)}\]

# first we need to compute these two metrics for each word in two corpora
metric_data <- token_data %>% 
  count(word_stem, name="tot_count",sort = TRUE) %>% 
  left_join(token_data %>% 
              filter(!is.na(viold)) %>% 
              mutate(viold=ifelse(viold==1,"V","NV")) %>% 
              count(viold,word_stem) %>% 
              pivot_wider(names_from=viold,values_from=n,values_fil=0),
            by="word_stem") %>% 
  mutate(
    fv=V/tot_count,
    fnv=NV/tot_count,
    fv_fnv=fv-fnv,
    weight=abs(fv_fnv)
    ) %>% 
    # drop top common words or rare words
  filter(tot_count<1500,tot_count>50)

knitr::kable(metric_data[1:10,],cap="DoCA with NYT article, Volenceor not")  
DoCA with NYT article, Volenceor not
word_stem tot_count NV V fv fnv fv_fnv weight
special 1499 1279 217 0.1447632 0.8532355 -0.7084723 0.7084723
colleg 1482 1388 94 0.0634278 0.9365722 -0.8731444 0.8731444
commun 1471 1257 210 0.1427600 0.8545207 -0.7117607 0.7117607
public 1443 1313 129 0.0893971 0.9099099 -0.8205128 0.8205128
women 1438 1350 88 0.0611961 0.9388039 -0.8776078 0.8776078
continu 1432 1200 230 0.1606145 0.8379888 -0.6773743 0.6773743
dr 1386 1171 213 0.1536797 0.8448773 -0.6911977 0.6911977
without 1375 1179 196 0.1425455 0.8574545 -0.7149091 0.7149091
report 1374 1071 302 0.2197962 0.7794760 -0.5596798 0.5596798
first 1354 1154 200 0.1477105 0.8522895 -0.7045790 0.7045790

let us get top 10 violent words and top nonviolent words

top50_words <- metric_data %>% 
  top_n(10,fv_fnv) %>% 
  bind_rows(metric_data %>% 
  top_n(-10,fv_fnv))

let us replicate the graph

metric_data %>% 
  filter(!is.na(fv_fnv)) %>% 
  ggplot(aes(x=tot_count,
             y=fv_fnv))+
  geom_point()+
  theme_bw()+
  theme(legend.position = "none")

metric_data %>% 
  filter(!is.na(fv_fnv)) %>% 
  left_join(top50_words %>% transmute(word_stem,top50_words=word_stem),by="word_stem") %>% 
  ggplot(aes(x=tot_count,
             y=fv_fnv))+
  geom_point()+
  ggrepel::geom_label_repel(aes(label=top50_words))+
  theme_bw()+
  theme(legend.position = "none")

Lab 5 Problem Set

We will continue to use NYT doca 2000 news articles as our dataset for text viz. You need to train a topic model using stm.

Then you should visualize the topics using some extra packags we provided in stm lab tutorial.

Send me a screen-shot before Tuesday 6 PM.

The End